Trees and Nets
Kerry Back
BUSI 520, Fall 2022
JGSB, Rice University
Decision tree
- Split sample sucessively into smaller subsamples by answering “yes-no” questions.
- Each question is based on a single variable: is it above a threshold?
- Split a specified number of times (depth). Final subsamples are called leaves.
- For regression problems, the prediction for each observation is the mean of the training observations that end up in the same leaf.
- For classification problems the prediction is the category that occurs most frequently in the leaf.
Example of a tree
Decision tree splitting for regression
- At each step, the algorithm chooses a variable and threshold to split on.
- In a regression problem, the standard decision criterion is SSE.
- Error = actual - mean of subset
- Outliers tend to get separated into their own subsets (less so for SAE)
- There are also other relatively standard criteria, for example in scikit-learn.
Decision tree splitting for classification
- Let \(k\) = number of classes. Suppose a split results in \(n\) observations in a subset. Let \(x_i=\) fraction in class \(i\).
- Default criterion for scikit-learn is Gini = \(1 - \sum_{i=1}^k x_i^2\)
- If we consider \(\sum x_i^2\) subject to \(\sum x_i=n\), the value is maximized at \(x_i=1/n\) and minimized at boundaries: some \(x_i=1\) and others equal 0.
- So, maximizing Gini means trying to create “pure” subsets (all of same type).
- There are also other standard criteria, for example entropy and log loss.
Random forest
- Multiple trees fit to random data
- Data for each tree is a bootstrapped sample:
- random selection of rows (with replacement)
- same size as original sample
- Prediction is average of predictions of the trees
- Hyperparameters = number of trees and depth of trees
Boosting
- Boosting means combining models, for example trees.
- Prediction is sum of individual predictions.
- Gradient boosting fits second model to errors of first, third to errors of first two combined, …
- Adaptive boosting fits original target but overweights errors from prior model
- Learning rate (weight to put on new model) is a hyperparameter
Multi-layer perceptrons
- A multi-layer perceptron (MLP) consists of “neurons” arranged in layers.
- A neuron is a mathematical function. It takes inputs \(x_1, \ldots, x_n\), calculates a function \(y=f(x_1, \ldots, x_n)\) and passes \(y\) to the neurons in the next level.
- The inputs in the first layer are the predictors.
- The inputs in successive layers are the calculations from the prior level.
- The last layer is a single neuron that produces the output.
Illustration
- input is \(x \in \mathbb{R}^4\)
- functions \(f_1, \ldots, f_5\) of \(x\) are calculated (called “hidden layer”)
- output is \(g(f_1(x), \ldots, f_5(x))\)
Rectified linear units
- The usual function for the neurons (except in the last layer) is \[ y = \max(0,b+w_1x_1 + \cdots + w_nx_n)\] Parameters \(b\) (called bias) and \(w_1, \ldots w_n\) (called weights) are different for different neurons.
- This function is called a rectified linear unit (RLU).
- Last layer uses a linear function \[ y = b+w_1x_1 + \cdots + w_nx_n\]
Analogy to neurons firing
If \(w_i>0\) and \(b<0\), then \(y>0\) only when \(x_i\) are large enough.
A neuron fires when it is sufficiently stimulated by signals from other neurons (in prior layer).
Deep learning
- Deep learning means a neural network with many layers. It is behind facial recognition, self-driving cars, …
- Need specialized library, probably TensorFlow (from Google) or PyTorch (from Facebook) and probably need graphical processing units (GPUs) – i.e., run on video cards
- Hands-On Machine Learning with Scikit-Learn and TensorFlow
- Deep Learning for Coders with fastai and PyTorch (also fastai website)